Chapter 1 

Likelihood-free Markov chain Monte 
Carlo 

Scott A. Sisson and Yanan Fan 

1.1 Introduction 

In Bayesian inference, the posterior distribution for parameters 6 G is given by ir{6\y) oc 
-n{y\6)it{6), where one's prior beliefs about the unknown parameters, as expressed through 
the prior distribution n(9), is updated by the observed data y G y via the likelihood function 
ir(y\9). Inference for the parameters 9 is then based on the posterior distribution. Except in 
simple cases, numerical simulation methods, such as Markov chain Monte Carlo (MCMC), 
are required to approximate the integrations needed to summarise features of the posterior 
distribution. Inevitably, increasing demands on statistical modelling and computation have 
resulted in the development of progressively more sophisticated algorithms. 

Most recently there has been interest in performing Bayesian analyses for models which 
are sufficiently complex that the likelihood function n(y\9) is either analytically unavailable 
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Table 1.1: The likelihood- free rejection sampling algorithm (Tavare et al., 1997). Accepted 
parameter vectors are drawn approximately from 7r(9\y). 



Likelihood-free rejection sampling algorithm 



1. Generate 6' ~ tt(9) from the prior. 

2. Generate dataset x from the model tt(x\9'). 

3. Accept 9' if x ~ y. 



or computationally prohibitive to evaluate. The classes of algorithms and methods devel- 
oped to perform Bayesian inference in this setting have become known as likelihood-free 



computation or approximate Bayesian computation (Beaumont et al. , 2002; Marjoram et al. 



. 2009; 


Sisson et al. 


2007 


Tavare et al. , 


1997) 



circumventing of explicit evaluation of the likelihood by a simulation-based approximation. 

Likelihood-free methods are rapidly gaining popularity as a practical approach to fitting 
models under the Bayesian paradigm that would otherwise have been computationally im- 
practical. To date they have found widespread usage in a diverse range of applications. 



These include wireless communications engineering (Nevat et al. 2008), quantile distribu- 



tions (Drovandi and Pettitt 2009), HIV contact tracing (Blum and Tran, 2009), the evolution 



of drug resistance in tuberculosis (Luciani et al. , 2009), population genetics (Beaumont et al. 



2002), protein networks (Ratmann et al. , 2009, 2007), archaeology (Wilkinson and Tavare 



2009); ecology (Jabot and Chave, 2009), operational risk (Peters and Sisson, 2006), species 



migration (Hamilton et al. , 2005), chain-ladder claims reserving (Peters et al. 2008), coales- 



cent models (Tavare et al. , 1997), a-stable models (Peters et al., 2009), models for extremes 



(Bortot et al. , 2007), susceptible-infected-removed (SIR) models (Toni et al. , 2009), pathogen 



transmission (Tanaka et al. 2006) and human evolution (Fagundes et al. , 2007). 



The underlying concept of likelihood-free methods may be simply encapsulated as follows 



(see Table 1.1): For a candidate parameter vector 9', a dataset is generated from the model 
(i.e. the likelihood function) x ~ tc(x\9'). If the simulated and observed datasets are similar 
(in some manner), so that x « y, then 9' is a good candidate to have generated the observed 
data from the given model, and so 9' is retained and forms as a part of the samples from the 
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posterior distribution Tr(6\y). Conversely, if x and y are dissimilar, then 6' is unlikely to have 
generated the observed data for this model, and so 8' is discarded. The parameter vectors 
accepted under this approach offer support for y under the model, and so may be considered 
to be drawn approximately from the posterior distribution 7i(6\y). In this manner, the 
evaluation of the likelihood ir(y\6'), essential to most Bayesian posterior simulation methods, 
is replaced by an estimate of the proximity of a simulated dataset x ~ tt(x\8') to the observed 
dataset y. While available in various forms, all likelihood-free methods and models apply 
this basic principle. 

In this article we aim to provide a tutorial-style exposition of likelihood-free modelling 



and computation using MCMC simulation. In Section 1.2 we provide an overview of the 
models underlying likelihood-free inference, and illustrate the conditions under which these 
models form an acceptable approximation to the true, but intractable posterior 7i(6\y). In 



Section 1.3 we examine how MCMC-based samplers are able to circumvent evaluation of the 
intractable likelihood function, while still targetting this approximate posterior model. We 
also discuss different forms of samplers that have been proposed in order to improve algorithm 



and inferential performance. Finally, in Section L4 we present a step-by-step examination of 
the various practical issues involved in performing an analysis using likelihood-free methods, 
before concluding with a discussion. 

Throughout we assume a basic familiarity with Bayesian inference and the Metropolis- 
Hastings algorithm. For this relevant background information, the reader is referred to the 
many useful articles in this volume. 



1.2 Review of likelihood-free theory and methods 



In this Section we discuss the modelling principles underlying likelihood-free computation. 
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1.2.1 Likelihood-free basics 



A common procedure to improve sampler efficiency in challenging settings is to embed the 
target posterior within an augmented model. In this setting, auxiliary parameters are in- 
troduced into the model whose sole purpose is to facilitate computations (see for exam- 
ple simulated tempering or annealing methods (Geyer and Thompson 1995 Neal, 2003[ )). 
Likelihood-free inference adopts a similar approach by augmenting the target posterior from 
tt(%) oc ir(y\9)ir(9) to 

n LF (9,x\y) <x7r(y\x,9)ir(x\9)7r(9) (1.2.1) 



where the auxiliary parameter x is a (simulated) dataset from ir(x\9) (see Table 1.1), on the 



same space as y E y ( |Reeves and Pettitt 2005 Wilkinson, 2008). As discussed in more 



detail below (Section 1.2.2), the function ir(y\x,6) is chosen to weight the posterior ir(9\x) 



with high values in regions where x and y are similar. The function 7i(y\x, 6) is assumed to 
be constant with respect to 9 at the point x = y, so that Tc(y\y, 9) = c, for some constant 
c > 0, with the result that the target posterior is recovered exactly at x = y. That is, 
KLF(9,y\y) oc 7r(y|0)7r(0). 

Ultimately interest is typically in the marginal posterior 



ttlf (9 \y) oc tt(9) / 7r(y\x, 9)-7r(x\9)dx, 
Jy 



[1.2.2) 



integrating out the auxiliary dataset x. The distribution ni jF {9\y) then acts as an approxi- 
mation to 7i(9\y). In practice this integration is performed numerically by simply discarding 
the realisations of the auxiliary datasets from the output of any sampler targetting the joint 



posterior ir LF (9 } x\y). Other samplers can target ir LF (9\y) directly - see Section 1.3.1 



1.2.2 The nature of the posterior approximation 

The likelihood-free posterior distribution 7tL F (9\y) will only recover the target posterior 
7r{9\y) exactly when the function n(y\x,9) is precisely a point mass at y — x and zero 
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elsewhere (Reeves and Pettitt 2005). In this case 



Tr LF (6\y) oc tt(0) / ir(y\x,9)ir(x\9)dx = 7r(y\9)ir(9). 
Jy 

However, as observed from Table [l~Tj this choice for 7r(y\x, 9) will result in a rejection sampler 
with an acceptance probability of zero unless the proposed auxiliary dataset exactly equals 
the observed data x = y. This event will occur with probability zero for all but the simplest 
applications (involving very low dimensional discrete data). In a similar manner, MCMC- 



based likelihood-free samplers (Section 1.3) will also suffer acceptance rates of zero. 



In practice, two concessions are made on the form of ir(y\x,9), and each of these can 



induce some form of approximation into TTLp(9\y) (Marjoram et al. , 2003). The first allows 



the function to be a standard smoothing kernel density, K, centered at the point x = y and 
with scale determined by a parameter vector e, usually taken as a scalar. In this manner 



\x,9) 



1 



-K 



\x 



y\ 



weights the intractable likelihood with high values in regions x ~ y where the auxiliary 
and observed datasets are similar, and with low values in regions where they are not similar 



(Beaumont et al. , 2002 Blum, 2009; Peters et al., 2008). The interpretation of likelihood-free 



models in the non-parametric framework is of current research interest (Blum, 2009). 



The second concession on the form of 7r e (|/|x,#) permits the comparison of the datasets, 
x and y, to occur through a low- dimensional vector of summary statistics T(-), where 
dim(T(-)) > dim(#). Accordingly, given the improbability of generating an auxiliary dataset 
such that x ~ y, the function 



ir e (y\x,9) 



l_ K f \T{x)-T{y)\ 



(1.2.3) 



will provide regions of high value when T(x) ~ T(y) and low values otherwise. If the vector 
of summary statistics is also sufficient for the parameters 9, then comparing the summary 
statistics of two datasets will be equivalent to comparing the datasets themselves. Hence 
there will be no loss of information in model fitting, and accordingly no further approximation 
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will be introduced into n^p^ly). However, the event T(x) ~ T(y) will be substantially more 
likely than x ~ y, and so likelihood-free samplers based on summary statistics T(-) will 
in general be considerably more efficient in terms of acceptance rates than those based on 



full datasets (Pritchard et al. 1999 Tavare et al. 1997). As noted by McKinley et al 



(2009), the procedure of model fitting via summary statistics T(-) permits the application 
of likelihood-free inference in situations where the observed data y are incomplete. 



Note that under the form (1.2.3), lim e _^ n e (y\x, 9) is a point mass on T(x) = T(y). Hence, 
if T(-) are also sufficient statistics for 9, then \im e ^ n LF (9\y) = n(9\y) exactly recovers the 



intractable posterior (Reeves and Pettitt 2005). Otherwise, if e > or if T(-) are not 
sufficient statistics, then the likelihood-free approximation to n(9\y) is given by n LF (9\y) in 



(1.2.2). 



A frequently utilised weighting function ir e (y\x, 9) is the uniform kernel density (Marjoram 



et al. , 2003 Tavare et al. , 1997), whereby T(y) is uniformly distributed on the sphere centered 
at T(x) with radius e. This is commonly written as 



717 



x, 9) oc 



1 iip(T(x),T(y))<e 
otherwise 



;i.2.4) 



where p denotes a distance measure (e.g. Euclidean) between T{x) and T(y). In the form 
of (1.2.3) this is expressed as n e (y\x,9) = e~ 1 K u (p(T(x),T(y))/e), where K u is the uniform 



kernel density. Alternative kernel densities that have been implemented include the Epanech- 



nikov kernel (Beaumont et al. , 2002), a non-parametric density estimate (Ratmann et al. 



2009) (see Section 1.3.2), and the Gaussian kernel density (Peters et al. 2008), whereby 



Ti e (y\x, 9) is centered at T(x) and scaled by e, so that T{y) ~ N(T(x), Se 2 ) for some covari- 
ance matrix E. 



1.2.3 A simple example 

As an illustration, we examine the deviation of the likelihood-free approximation from the 
target posterior in a simple example. Consider the case where 7r(9\y) is the univariate N(0, 1) 
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(a) (b) (c) 
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Figure 1.1: Comparison of likelihood-free approximations to the N(0, 1) target posterior 
(solid line). Likelihood-free posteriors are constructed using uniform (dotted line) and Gaus- 
sian (dashed line) kernel weighting densities ir e (y\x, 6). Panels (a)-(c) correspond to e values 

of v3, v3/2 and v3/10 respectively. 



density. To realise this posterior in the likelihood-free setting, we specify the likelihood as 
x ~ N(6, 1), define T(x) = x as a sufficient statistic for 9 (the sample mean) and set the 
observed data y = 0. With the prior tt(6) oc 1 for convenience, if the weighting function 



ir e (y\x } 6) is given by (1.2.4), with p(T(x),T(y)) = \x — y\, or if Tt € (y\x,6) is a Gaussian 



density with y ~ N(x, e 2 /3) then respectively 

*lf{0\v) oc $(e ~ g) ~^~ e ~^ an d ir LF (0\y) = N(0, 1 + e 2 /3), 

where $(•) denotes the standard Gaussian cumulative distribution function. The factor of 
3 in the Gaussian kernel density ensures that both uniform and Gaussian kernels have the 
same standard deviation. In both cases 'Klf^v) ~^ N(0, 1) as e — > 0. 

The two likelihood-free approximations are illustrated in Figure 1.1| which compares the 
target 7f(6\y) to both forms of TtLF{9\y) for different values of e. Clearly, as e gets smaller 
then 7CLF{@\y) ~ 7r(0|?/) becomes a better approximation. Conversely, as e increases, then 
so does the posterior variance in the likelihood-free approximation. There is only a small 
difference between using uniform and Gaussian weighting functions in this case. 

Suppose now that an alternative vector of summary statistics T(-) also permits unbiased 
estimates of 6, but is less efficient than T(-), with a relative efficiency of e < 1. As noted by A. 
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N. Pettitt (personal communication), for the above example with the Gaussian kernel density 
for it e (y\x, 9), the likelihood-free approximation using T(-) becomes n LF (9\y) = N(0, 1/e + 
e 2 /3). The 1/e term can easily be greater than the e 2 /3 term, especially as practical interest 
is in small e. This example illustrates that inefficient statistics can often determine the 
quality of the posterior approximation, and that this approximation can remain poor even 
for e = 0. 

Accordingly, it is common in practice to aim to reduce e as low as is computationally 
feasible. However, in certain circumstances, it is not clear that doing so will result in a 



better approximation to Tc(9\y) than for a larger e. This point is illustrated in Section 1.4.4 



1.3 Likelihood-free MCMC samplers 



A Metropolis-Hastings sampler may be constructed to target the augmented likelihood-free 



posterior n LF (9, x\y) (given by 1.2.1) without directly evaluating the intractable likelihood 



(|Marjoram et al. 2003). Consider a proposal distribution for this sampler with the factori- 

q[{e,x),{e',x')] = q {e,e')K{x'\e'). 



sat ion 



That is, when at a current algorithm state (9,x), a new parameter vector 9' is drawn from 
a proposal distribution q(9,9'), and conditionally on 9' a proposed dataset x' is generated 
from the model x' ~ n(x\9'). Following standard arguments, to achieve a Markov chain 
with stationary distribution ttl F (9, x\y), we enforce the detailed-balance (time-reversibility) 
condition 

7c LF (9, x\y)P[(9, x), (9', x')} = n LF (9', x'\y)P[(9', x'), (9, x)} (1.3.1) 



where the Metropolis-Hastings transition probability is given by 



P[(0, x), (9\ x')] = q[(0, x), (9\ x')]a[(0, x), (9', x')]. 
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Table 1.2: The likelihood-free MCMC algorithm, generalised from Marjoram et al. (2003). 



LF-MCMC Algorithm 



1. Initialise (9o,xo) and e. Set t = 0. 
At step t: 

2. Generate 9' ~ q(9 t , 9) from a proposal distribution. 

3. Generate x' ~ ir{x\9') from the model given 9'. 

4. With probability min{l, set (0 t+1 , = (9', x>) 

otherwise set (9 t+1 ,x t +i) = {9 t ,x t ). 

5. Increment t — t + 1 and go to 2. 



The probability of accepting a move from to {9',x') within the Metropolis-Hastings 

framework is then given by min{l, a[(9, x), (9', x')]}, where 



a[(9,x),(9',x')} 



7T LF (9',x'\y)q[(9' 1 x'),(9,x)] 
7r LF (9,x\y)q[(9,x),(9',x')} 
n € {y\x', 9')tt(x'\9')tt(9') q{&, 9)n{x\9) 
n e {y\Xi 9)ti{x\9)k{9) q(9, 9')n(x'\9') 
7r e (y\x',0'M9')q(9',9) 



(1.3.2) 



ir e (y\x,9)<K{9)q(9,9') ' 

Note that the intractable likelihoods do not need to be evaluated in the acceptance proba- 



bility calculation (1.3.2), leaving a computationally tractable expression which can now be 



evaluated. Without loss of generality we may assume that min{l, a[(9', x'), (9, x)]} = 1, and 



hence the detailed-balance condition (1.3.1), is satisfied since 



7i LF (9,x\y)P[(9,x),(9',x')} 



tt lf (9, x\y)q[(9, x), (9', x')]a[{9, x), (9', x')} 
klf(9, x\y)q(6, 9')n(x>\9>)n e (y\x', 9')Tr(9')q(9' , 9) 

7r e (y\x,9)7r(9)q(9,9') 
7r e (y\x, 9)7t(x\9)7r(9)g(9, 9')7r(x'\9')7r t (y\^ , 0>{0'W , ' 

7r e (y\x,9)7r(9)q(9,9') 
n e (y\x', 9')n(x'\9')n(9')q(9', 9)ir{x\9) 



7r LF (9',x'\y)P[(9',x'),(9,x)). 
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The MCMC algorithm targetting Tix J p(9, x\y), adapted from Marjoram et al.| ( |2003 ), is 



listed in Table 1.2 The sampler generates the Markov chain sequence (8 t ,x t ) for t > 0, 



although in practice, it is only necessary to store the vectors of summary statistics T(x t ) 
and T(x') at any stage in the algorithm. This is particularly useful when the auxiliary 
datasets x t are large and complex. 

An interesting feature of this sampler is that its acceptance rate is directly related to 



the value of the true likelihood function n(y\9') at the proposed vector 6' (Sisson et al. 



2007). This is most obviously seen when using the uniform kernel weighting function (1.2.4) 



as proposed moves to (8',x') can only be accepted if p(T(x'),T(y)) < e, and this occurs 
with a probability in proportion to the likelihood. For low e values this can result in very 
low acceptance rates, particularly in the tails of the distribution, thereby affecting chain 



mixing in regions of low posterior density. See Section 1.4.5 for an illustration. However 



the LF-MCMC algorithm offers improved acceptance rates over rejection sampling-based 



likelihood-free algorithms (Marjoram et al. , 2003). 



We now examine a number of variations on the basic LF-MCMC algorithm which have 
been proposed either to improve sampler performance, or to examine model goodness-of-fit. 



1.3.1 Marginal space samplers 



Given the definition of TCip(9\y) in (1.2.2), an unbiased pointwise estimate of the marginal 



posterior distribution is available through Monte Carlo integration as 



(1.3.3) 



8=1 



where independent draws from the model n(x\0) ( |Marjoram et al.[|2003[|Peters 

et~aL| [20081 IRatmann eTa!| [20091 IReeves and Pettittj [20051 ISisson et~aL| [20071 |Toni et al 



2009 Wegmann et al. , 2009). This then permits an MCMC sampler to be constructed directly 



targetting the likelihood-free marginal posterior iiip(8\y). In this setting, the probability of 
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accepting a proposed move from 9 to 9' ~ q(6, 8') is given by min{l, a(9, 6')} where 

7r LF (9'\y)q(9', 9) J £ s ^(y|x' s , 5) 



a(9,9') 



Tr LF (9\y)q(9,9>) ' iE s ^i^M%(M') 



;i.3.4) 



1 5* 

where x ,...,x ~ tt(x\9'). As the Monte Carlo approximation (1.3.3) becomes more 
accurate as 5* increases, the performance and acceptance rate of the marginal likelihood-free 
sampler will gradually approach that of the equivalent standard MCMC sampler. 

However, the above ratio of two unbiased likelihood estimates is only unbiased as S — > oo. 
Hence, the above sampler will only approximately target 7Clf(9\i/) for large S, which makes 
it highly inefficient. However, note that estimating a(9, 9') with S = 1 exactly recovers 



(1.3.2), the acceptance probability of the MCMC algorithm targetting ttlf(9, x\y). That 
is, the marginal space likelihood-free sampler with S = 1 is precisely the likelihood-free 



MCMC sampler in Table 1.2. As the sampler targetting ttlf(9 } x\y) also provides unbiased 
estimates of the marginal tt LF{9\y), it follows that the likelihood- free sampler targetting 



ftLF(9\y) directly is also unbiased in practice (Sisson et al. , 2008). A similar argument for 
S > 1 can also be made, as outlined below. 

An alternative augmented likelihood-free posterior distribution is given by 



^LF(9,x 1 ..s\y) oc n e (y\x 1: s,8)ir(x 1: s\9)ir(9) 



1 S 

s=l 



\x s ,9) 



l[n(x s \9)] 



s=l 



n(9), 



where xi-s 



[x 



,x s ) represents s — 1, . . . ,S replicate auxiliary datasets x s ~ n(x\9). 



This posterior, generalised from Del Moral et al. (2008), is based on the more general ex- 



pected auxiliary variable approach of Andrieu et al. (2008), where the summation form 
of n t (y\xi : s, 9) describes this expectation. The resulting marginal posterior ^L F {9\y) = 
fys 7t~LF(9,xi;s,9\y)dxi;s is the same for all S, namely ^lf^v) — ^LF(9\y). 

The motivation for this form of posterior is that that a sampler targetting ttlf{9, Xi : s\y), 
for S > 1, will possess improved sampler performance compared to an equivalent sampler 
targetting itlf(9, x\y), through a reduction in the variability of the Metropolis-Hastings 
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acceptance probability. With the natural choice of proposal density given by 



q[(9, x 1:S ), (9', x' hS )] = q(9, 9') f] ir{x' s \9'), 



3=1 



where x' v! . 



\x 



X 



the acceptance probability of a Metropolis-Hastings algorithm 



targetting n lf{0 xi-.s\y) reduces to 



a[(9,xi :S ), (0',x' 1:S )} 



;i.3.5) 



This is the same acceptance probability (1.3.4) as a marginal likelihood-free sampler tar- 



getting n LF (9\y) directly, using S Monte Carlo draws to estimate ir LF (9\y) pointwise, via 



(1.3.3). Hence, both marginal and augmented likelihood-free samplers possess identical mix- 



ing and efficiency properties. The difference between the two is that the marginal sampler 



acceptance probability (1.3.4) is approximate for finite S, whereas the augmented sampler 



acceptance probability (1.3.5) is exact. However, clearly the marginal likelihood-free sampler 



is, in practice, unbiased for all S > 1. See Sisson et al. (2008) a for more detailed analysis. 



1.3.2 Error-distribution augmented samplers 

In all likelihood-free MCMC algorithms, low values of e result in slowly mixing chains through 
low acceptance rates. However, it also provides a potentially more accurate posterior approx- 
imation 7[Lp(9\y) « 7i(9\y). Conversely, MCMC samplers with larger e values may possess 
improved chain mixing and efficiency, although at the expense of a poorer posterior ap- 



proximation (e.g. Figure 1.1). Motivated by a desire for improved sampler efficiency while 



realising low e values, Bortot et al. (2007) proposed augmenting the likelihood-free posterior 



approximation to include e, so that 



7i LF (9,x,e\y) oc 7i e (y\x,9)n(x\9)7i(9)TT(e). 



Accordingly, e is treated as a tempering parameter in the manner of simulated tempering 
(Geyer and Thompson, 1995), with larger and smaller values respectively corresponding to 
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'hot" and "cold" tempered posterior distributions. The density 7r(e) is a pseudo-prior, which 



serves only to influence the mixing of the sampler through the tempered distributions. Bortot 



et al. (2007) suggested using a distribution which favours small e values for accuracy, while 
permitting large values to improve chain acceptance rates. The approximation to the true 
posterior 7i(9\y) is then given by 



^lf^Iv) = / / 7^LF(0,x,e\y)dxde 
Je Jy 



where e G £ C ]R + . Sampler performance aside, this approach permits an a posteriori 
evaluation of an appropriate value e = e* such that 7r£ F (£%) with £ = [0, e*] provides an 
acceptable approximation to ir(9\y). 



An alternative error-distribution augmented model was proposed by Ratmann et al. (2009) 
with the aim of diagnosing model mis-specification for the observed data y. For the vector 
of summary statistics T(x) = (Ti(x), . . . ,Tr(x)), the discrepancy between the model ir(x\6) 
and the observed data is given by r = (ti, . . . , tr), where r r = T r (x) — T r (y), for r = 1, . . . , R, 
is the error under the model in reproducing the r-th element of T(-). The joint distribution 
of model parameters and model errors is defined as 

7TL F (0,xi..s,r\y) oc ir e (y\r, x 1:S , 6)Ti(xi: S \0)'K(d)Ti(T) 

:= mm£ r (T r \y,x 1:S ,9)iT(x 1: s\9)ii(9)ii(T), (1.3.6) 

r 

where the univariate error distributions 

€r(T r \y,x 1: s,9) = — 2_^K [ I (1.3.7) 



8=1 



are constructed from smoothed kernel density estimates of model errors, estimated from S 
auxiliary datasets x 1 , . . . , x s , and where 7t(t) = Yi r ^^r)-, the joint prior distribution for the 
model errors, is centered on zero, reflecting that the model is assumed plausible a priori. The 
terms min r £ r (T r \y, x, 9) and 7r(r) take the place of the weighting function Tr e (y\r, Xi-.s, 9). The 
minimum of the univariate densities £, r { T r\y, 9) is taken over the R model errors to reflect 
the most conservative estimate of model adequacy, while also reducing the computation on 



14 



CHAPTER 1. LIKELIHOOD-FREE MCMC 



the multivariate r to its univariate component margins. The smoothing bandwidths e r of 
each summary statistic T r (-) are dynamically estimated during sampler implementation as 
twice the interquartile range of T r (x s ) — T r (y), given x 1 , . . . ,x s . 

Assessment of model adequacy can then be based on TT LF (r\y) = J e Jy S n LF (9, Xi : s, r\y)dxi : sd9, 
the posterior distribution of the model errors. If the model is adequately specified then 
^lf{t\u) should be centered on the zero vector. If this is not the case then the model is 
mis-specified. The nature of the departure of ^lf{^\v) from the origin e.g. via one or more 
summary statistics T r (-), may indicate the manner in which the model is deficient. See e.g. 



Wilkinson (2008) for further assessment of model errors in likelihood-free models. 



1.3.3 Potential alternative MCMC samplers 

Given the variety of MCMC techniques available for standard Bayesian inference, there are 
a number of currently unexplored ways in which these might be adapted to improve the 
performance of likelihood-free MCMC samplers. 



For example, within the class of marginal space samplers (Section 1.3.1), the number of 



Monte Carlo draws S determines the quality of the estimate of ttlf^u) ( c -f- 1-3.3). A 



standard implementation of the delayed-rejection algorithm (Tierney and Mira ; 1999) would 



permit rejected proposals based on poor but computationally cheap posterior estimates (i.e. 
using low-moderate S), to generate more accurate but computationally expensive second- 
stage proposals (using large S), thereby adapting the computational overheads of the sampler 
to the required performance. 

Alternatively, coupling two or more Markov chains targetting ttlf(8, x\y), each utilising 
a different e value, would achieve improved mixing in the "cold" distribution (i.e. the chain 
with the lowest e) through the switching of states between neighbouring (in an e sense) 
chains (Pettitt |2006 ). This could be particularly useful in multi-modal posteriors. While 
this flexibility is already available with continuously varying e in the augmented sampler 



targetting 7Tlf(0, x, e\y) (Bortot et al. (2007), Section 1.3.2), there are benefits to constructing 
samplers from multiple chain sample-paths. 
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Finally, likelihood-free MCMC samplers have to date focused on tempering distributions 
based on varying e. While not possible in all applications, there is clear scope for a class 
of algorithms based on tempering on the number of observed datapoints from which the 
summary statistics T(-) are calculated. Lower numbers of datapoints will produce greater 
variability in the summary statistics, in turn generating wider posteriors for the parameters 
9, but with lower computational overheads required to generate the auxiliary data x. 



1.4 A practical guide to likelihood-free MCMC 

In this Section we examine various practical aspects of likelihood-free computation under a 
simple worked analysis. For observed data y — (yi, ■ ■ ■ ,1/20) consider two candidate models: 
yi ~ Exponential A) and y% ~ Gamma(fc, if)), where model equivalence is obtained under 
k — 1, if> — 1/A. Suppose that the sample mean and standard deviation of y are available 
as summary statistics T(y) = (y, s y ) = (4, 1), and that interest is in fitting each model and 
in establishing model adequacy. Note that the summary statistics T(-) are sufficient for A 
but not for (k,if>), where they form moment-based estimators. For the following we consider 
flat priors tt(A) oc 1, 7r(k, if)) oc 1 for convenience. The true posterior distribution under the 
Exponential A) model is X\y ~ Gamma(21, 80). 



1.4.1 An exploratory analysis 



An initial exploratory investigation of model adequacy is illustrated in Figure L2 which 
presents scatterplots of summary statistics versus summary statistics, and summary statistics 
versus parameter values under each model. Images are based on 2000 parameter realisations 
\,k,if> ~ U(0, 20) followed by summary statistic generation under each model parameter. 
Horizontal and vertical lines denote the values of the observed summary statistics T(y). 



From the plots of sample means against standard deviations, T(y) is clearly better rep- 
resented by the Gamma than the Exponential model. The observed summary statistics (i.e. 
the intersection of horizontal and vertical lines) lie in regions of relatively lower prior predic- 
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Figure 1.2: Scatterplots of summary statistics T(x) = (x,s x ) and parameter values \,k,ip 
under both Exponential(A) and Gamma(fc, ip) models, based on 2000 realisations X,k,ip ~ 
£7(0,20). Horizontal and vertical lines denote observed summary statistics T(y) = (4,1). 

Circles denote the MLE of A = 1/y = 1/4 under the Exponential model. Crosses denote 
method of moments estimators k = y 1 j s 2 y = 16 and ip = s 2 y jy = 1/4 under the Gamma 
model. 
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tive density under the Exponential model, compared to the Gamma. That is, a priori, the 
statistics T(y) appear more probable under the more complex model. 

Consider the plots of A -1 versus T{x) under the Exponential model. The observed statis- 
tics T(y) individually impose competing requirements on the Exponential parameter. An 
observed sample mean of y — 4 indicates that A^ 1 is most likely in the approximate range 
[3,5] (indicated by those A -1 values where the horizontal line intersects with the density). 
However, the sample standard deviation s y — 1 independently suggests that A" 1 is most 
likely in the approximate range [0.5, 1.5]. If either x or s x were the only summary statistic, 
then only one of these ranges are appropriate, and the observed data would be considerably 
more likely under the Exponential model. However, the relative model fits and model ade- 
quacies of the Exponential and Gamma can only be evaluated by using the same summary 
statistics on each model. (Otherwise, the model with the smaller number of summary statis- 
tics will be considered the most likely model, simply because it is more probable to match 
fewer statistics.) As a result, the competing constraints on A through the statistics x and 
s y are so jointly improbable under the Exponential model that simulated and observed data 
will rarely coincide, making T(y) very unlikely under this model. This is a strong indicator 
of model inadequacy. 

In contrast, the plots of k and ip against T(x) under the Gamma model indicate no 
obvious restrictions on the parameters based on T(y), suggesting that this model is flexible 
enough to have generated the observed data with relatively high probability. Note that from 
these marginal scatterplots, it is not clear that these statistics are at all informative for the 
model parameters. This indicates the importance of parameterisation for visualisation, as 
alternatively considering method of moments estimators as summary statistics (k, ip), where 
k = x 2 1 s\ and ip = s 2 /x, will result in strong linear relationships between (k,ip) and (k,ip). 
Of course, in practice direct unbiased estimators are rarely known. 
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1.4.2 The effect of e 



We now implement the LF-MCMC algorithm (Table 1.2) targetting the Exponential(A) 
model, with an interest in evaluating sampler performance for different e values. Recall that 
small e is required to obtain a good likelihood-free approximation to the intractable posterior 



TTLF(@\y) ~ K(6\y) (see Figure 1.1), where now 9 = A. However, implementing the sampler 
with low e can be problematic in terms of initialising the chain and in achieving convergence 
to the stationary distribution. 

An initialisation problem may occur when using weighting functions 7r e (?/|x, 9) with com- 



pact support, such as the uniform kernel (1.2.4) defined on [— e, e]. Here, initial chain values 



9 ,x ) are required such that 7i e (y\x ,9 ) ^ in the denominator of the acceptance proba- 



bility at time t = 1 (Table 1.2). For small e, this is unlikely to be the case for the first such 
parameter vector tried. Two naive strategies are to either repeatedly generate xq ~ it(x\6o), 
or similarly repeatedly generate 9 ~ ti{9) and xo ~ ^(xl^o), until 7r e (y\xo, 9q) ^ is achieved. 
However, the former strategy may never terminate unless 9$ is located within a region of 
high posterior density. The latter strategy may never terminate if the prior is diffuse with 
respect to the posterior. Relatedly, Markov chain convergence can be very slow for small 
e when moving through regions of very low density, for which generating x' ~ ir(x\9') with 
T(x') ~ T(y) is highly improbable. 

One strategy to avoid these problems is to augment the target distribution from ttlf(9, x\y) 
to ttlf(9, x, e\y) ( |Bortot et al. 2007), permitting a time-variable e to improve chain mixing 



(see Section |L3 for discussion on this and other strategies to improve chain mixing). A 
simpler strategy is to implement a specified chain burn-in period, defined by a monotonic 
decreasing sequence e t +i < e t , initialised with large eo, for which e t = e remains constant at 
the desired level for t > t*, beyond some (possibly random) time t* (e.g. ?). For example, 
consider the linear sequence et = max{eo — ct, e} for some c > 0. However, the issue here is 
in determining the rate at which the sequence approaches the target e: if c is too large, then 
tt = e before (9t,Xt) has reached a region of high density; if c is too small, then the chain 
mixes well but is computationally expensive through a slow burn in. 
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One self-scaling option for the uniform weighting function (1.2.4) would be to define 



e = p(T(x ),T(y)), and given the proposed pair (9',x') at time t, propose a new e value as 

e" = max{e, min{e', e t _i}} (1-4.1) 

where e' = p(T(x'),T(y)) > is the distance between observed and simulated summary 
statistics. If the proposed pair (9',x r ) are accepted then set e t = e", else set e t = tt-i- That 
is, the proposed e" is dynamically defined as the smallest possible value that results in a non- 
zero weighting function 7c et (y\x', 6') in the numerator of the acceptance probability, without 
going below the target e, and while decreasing monotonically. If the proposed move to (6', x') 
is accepted, the value e" is accepted as the new state, else the previous value e t -i is retained. 
Similar approaches could be taken with non-uniform weighting functions ir e (y\x, 9). 



Four trace plots of Xt and et for the Exponential(A) model are illustrated in Figure 1.3 
(a,b), using the above procedure. All Markov chains were initialised at Ao = 10 with target 
e = 3, proposals were generated via A' ~ iV(A t _i, 1) and the distance measure 

p(T(x),T(y)) = {[T{x)-T{y)Y^[T{x)-T{y)]} l/2 (1.4.2) 

is given by Mahalanobis distance. The covariance matrix S = Cov(T(y)) is estimated by 
the sample covariance of 1000 summary vectors T{x) generated from 7r(x|A) conditional on 
A = 0.25 the maximum likelihood estimate. All four chains converge to the high density 
region at A = 0.25 quickly, although at different speeds as the sampler takes different routes 
through parameter space. Mixing during burn-in is variable between chains, although overall 
convergence to e t = 3 is rapid. The requirement of tuning the rate of convergence, beyond 
specifying the final tolerance e, is clearly circumvented. 



Figure 1.3 (c,d) also illustrates the performance of the LF-MCMC sampler, post-convergence, 



based on four chains of length 100,000, each with different target e. As expected (see dis- 



cussion in Section 1.3), smaller e results in lower acceptance rates. In Figure 1.3 (c), e = 4.5 
(bottom trace), 4, 3.5 and 3 (top) result in post-convergence (of e t ) mean acceptance rates 
of 12.2%, 6.1%, 2.9% and 1.1% respectively. Conversely, precision (and accuracy) of the 



posterior marginal distribution for A increases with decreasing e as seen in Figure |L3 (d). 
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Figure 1.3: Performance of the LF-MCMC sampler for the Exponential A) model. [Top 
plots] Tra ce plo ts of (a) At and (b) e t for four chains using the self-scaling {e t } sequence 
given by (1.4.1). The MLE of A is 0.25 and the target e is 3. [Bottom plots] (c) Jittered 
trace plots ot A t with different target e = 4.5 (bottom), 4, 3.5 and 3 (top), (d) Posterior 
density estimates of A for the same chains based on a chain length of 100,000 iterations. 
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In practice, a robust procedure to identify a suitable target e for the likelihood-free 



MCMC sampler is not yet available. Wegmann et al. (2009) implement the LF-MCMC 
algorithm with a large e value to enhance chain mixing, and then perform a regression- 



based adjustment (Beaumont et al. , 2002 Blum and Francois, 2009) to improve the final 



posterior approximation. Bortot et al. (2007) implement the LF-MCMC algorithm target- 



ting the augmented posterior 7r LF (9,x,e\y) (see Section 1.3.2), and examine the changes 
in K £ LF {9\y) = f £ Jy7iLF(^,x,e\y)dxde, with £ = [0, e*], for varying e*. The final choice of 
e* is the largest value for which reducing e* further produces no obvious improvement in 
the posterior approximation. This procedure may be repeated manually through repeated 



LF-MCMC sampler implementations at different fixed e values (Tanaka et al. 2006). Re- 
gardless, in practice e is often reduced as low as possible such that computation remains 
within acceptable limits. 



1.4.3 The effect of the weighting function 

The optimal form of kernel weighting function ir e (y\x,6) for a given analysis is unclear at 
present. While the uniform weighting function ( |1.2.4 ) is the most common in practice - 



indeed, many likelihood-free methods have this kernel written directly into the algorithm 
(sometimes implicitly) - it seems credible that alternative forms may offer improved poste- 
rior approximations for given computational overheads. Some support for this is available 
through recently observed links between the likelihood-free posterior approximation 7Tl^(0|?/) 



and non-parametric smoothing (Blum, 2009) 



Here we evaluate the effect of the weighting function TT e (y\x,9) on posterior accuracy 
under the Exponential(A) model, as measured by the one-sample Kolmogorov-Smirnov dis- 
tance between the likelihood-free posterior sample and the true Gamma(21,80) posterior. To 
provide fair comparisons, we evaluate posterior accuracy as a function of computational over- 
heads, measured by the mean post-convergence acceptance rate of the LF-MCMC sampler. 
The following results are based on posterior samples consisting of 1000 posterior realisations 
obtained by recording every 1000 i/l chain state, following a 10,000 iteration burn-in period. 
Figures are constructed by averaging the results of 25 sampler replications under identical 
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Figure 1.4: Performance of the LF-MCMC sampler for the Exponential(A) model under vary- 
ing kernel weighting functions: (a) Mahalanobis distance between T(x) and T(y) evaluated 
on uniform, Epanechnikov and triangle kernel functions; (b) Mahalanobis, scaled Euclidean 
and Euclidean distance between T(x) and T(y) evaluated on the uniform kernel function. 
Sampler performance is measured in terms of accuracy (y-axis: one-sample Kolmogorov- 
Smirnov test statistic evaluated between likelihood-free posterior sample and true posterior) 
versus computational overheads (x-axis: mean sampler acceptance probability). 



conditions, for a range of e values. 



Figure L4l (a) shows the effect of varying the form of the kernel weighting function based 



on the Mahalanobis distance (1.4.2). There appears little obvious difference in the accuracy 



of the posterior approximations in this example. However, it is credible to suspect that 



Blum 


(2009 


); 


Peters et al. 



(2008)). This is more clearly demonstrated in Section 1.4.5 The slight worsening in the 
accuracy of the posterior approximation, indicated by the upturn for low e in Figure 1.4| (a), 
will be examined in more detail in Section 11.4.41 



Regardless of its actual form, the weighting function 7r e (y\x, 9) should take the distribu- 
tion of the summary statistics T(-) into consideration. Fan et al.l (2010) note that using 



a Euclidean distance measure (given by (1.4.2) with S = /, the identity matrix) within 



(say) the uniform weighting function (1.2.4), ignores the scale and dependence (correlation) 
structure of T(-), accepting sampler moves if T(y) is within a circle of size e centered on 
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T(x), rather than within an ellipse defined by X = Cov(T(y)). In theory, the form of the 
distance measure does not matter as in the limit e — > any effect of the distance measure 
p is removed from the posterior iTLF(0\y) i.e. T(x) = T(y) regardless of the form of S. In 
practice however, with e > 0, the distance measure can have a strong effect on the quality 
of the likelihood-free posterior approximation tt lf (8\y) ~ ir(9\y). 



Using the uniform weighting function, Figure 1.4 (b) demonstrates the effect of using 



Mahalanobis distance (1.4.2), with E given by estimates of Cov(T(y)), diag(Cov(T(y))) 
(scaled Euclidean distance) and the identity matrix / (Euclidean distance). Clearly, for a 
fixed computational overhead (x-axis), greater accuracy is attainable by standardising and 
orthogonalising the summary statistics. In this sense, Mahalanobis distance represents an 
approximate standardisation of the distribution of T(y)\8 at an appropriate point 9 following 



indirect inference arguments (Jiang and Turnbull, 2004). As Cov(T(y)) may vary with 6 



Fan et al. 



(2010) suggest using an approximate MAP estimate of 6, so that 9 resides in a 



region of high posterior density. The assumption is then that Cov(T(y)) varies little over 
the region of high posterior density. 



1.4.4 The choice of summary statistics 



Likelihood- free computation is based on the reproduction of observed statistics T(y) under 
the model. If the T(y) are sufficient for 6, then the true posterior 7f(8\y) can be recovered 



exactly as e — > 0. If dim(T(y)) is large (e.g. Bortot et al. (2007)), then likelihood-free 



algorithms become computationally inefficient through the need to reproduce large numbers 



of summary statistics (Blum, 2009). However, low-dimensional, non-sufficient summary 



vectors produce less efficient estimators of 8, and so generate wider posterior distributions 



^LF(0\y) than using sufficient statistics (see Section 1.2.3). Ideally, low-dimensional and 
near-sufficient T(y) are the preferred option. 

Unfortunately, it is usually difficult to know which statistics are near-sufficient in practice. 
A brute-force strategy to address this issue is to repeat the analysis, while sequentially in- 
creasing the number of summary statistics each time (in order of their perceived importance), 
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Figure 1.5: Likelihood-free posterior accuracy of the Exponential A) model as a function of 
e for differing summary statistics: (a) T{y) = y; (b) T{y) = s y ; (c) T(y) = (y,s y ). Pos- 
terior accuracy (y-axis) is measured by one-sample Kolmogorov-Smirnov (KS) test statistic 
evaluated between likelihood-free posterior sample and true posterior. Points and vertical 
lines represent KS statistic means and ranges based on 25 sampler replicates at fixed e lev- 
els. Crosses in panel (b) denote KS statistic evaluated with respect to a Gamma(21,20) 
distribution. 



until no further changes to 717,^(6%) are observed (Marjoram et al. , 2003). See also Joyce 



and Marjoram (2008). If the extra statistics are iminformative, the quality of approximation 
will remain the same, but the sampler will be less efficient. However, simply enlarging the 
number of informative summary statistics is not necessarily the best way to improve the 
likelihood-free approximation T[Lp{0\y) ~ ir(6\y), and in fact may worsen the approximation 
in some cases. 

An example of this is provided by the present Exponential(A) model, where either of 
the two summary statistics T(y) = (y, s y ) = (4, 1) alone is informative for A (and indeed, 
y is sufficient), as we expect that A ~ 1/y ~ l/s y under any data generated from this 
model. In this respect, however, the observed values of the summary statistics provide 
conflicting information for the model parameter (see Section 1.4.1). Figure 1.5 examines 
the effect of this, by evaluating the accuracy of the likelihood-free posterior approximation 
TTLF{6\y) ~ ^i^\y) as a function of e under different summary statistic combinations. As 
before, posterior accuracy is measured via the one-sample Kolmogorov-Smirnov test statistic 
with respect to the true Gamma(21,80) posterior. 

With T(y) = y, panel (a) demonstrates that accuracy improves as e decreases, as expected. 
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For panel (b), with T(y) = s y (dots), the resulting n LF (9\y) posterior is clearly different 
from the true posterior for all e. Of course, the limiting posterior as e — > is (very) 
approximately Gamma(21,20), resulting from an Exponential model with A = l/s y = 1, 
rather than Gamma(21,80) resulting from an Exponential model with A = 1/y = 1/4. 
The crosses in panel (b) denote the Kolmogorov-Smirnov test statistic with respect to the 
Gamma(21,20) distribution, which indicates that 7Tlf(^|?/) is roughly consistent with this 
distribution as e decreases. That the Gamma(21,20) is not the exact limiting density (i.e. 
the KS statistic does not tend to zero as e — > 0) stems from the fact that s y is not a sufficient 
statistic for A, and is less then fully efficient. 

In panel (c) with T(y) = (y, s y ), which contains an exactly sufficient statistic (i.e. y), the 
accuracy of n LF (9\y) appears to improve with decreasing e, and then actually worsens before 
improving again. This would appear to go against the generally accepted principle, that for 
sufficient statistics, decreasing e will always improve the approximation -n^p (6\y) rj ir(9\y). 
Of course, the reality here is that both of these competing statistics are pulling the likelihood 
free posterior in different directions, with the consequence that the limiting posterior as 
e — > will be some combination of both Gamma distributions, rather than the presumed 
(and desired) Gamma(21,80). 

This observation leads to the uncomfortable conclusion that model comparison through 
likelihood-free posteriors with a fixed vector of summary statistics T(y), will ultimately 
compare distortions of those models which are overly simplified with respect to the true 
data generation process. This remains true even when using sufficient statistics and for 
e -> 0. 



1.4.5 Improving mixing 



Recall that the acceptance rate of the LF-MCMC algorithm (Table 1.2) is directly related 



to the value of the true likelihood n(y\9') at the proposed vector 9' (Section 1.3). While 



this is a necessary consequence of likelihood-free computation, it does imply poor sampler 
performance in regions of low probability, as the Markov chain sample-path may persist in 
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Figure 1.6: Aspects of LF-MCMC sampler performance: [Top plots] Trace plots of (a) k and 
(b) ip parameters under the Gamma model, for varying numbers of auxiliary datasets S = 1 
(lower traces), 10,25 and 50 (upper traces) using e = 2 and the uniform kernel function 
7i e (y\x, 6). [Bottom plots] Distribution of sojourn lengths of parameter k above (c) k = 45 
and (d) k = 50 for varying numbers of auxiliary datasets. Boxplot shading indicates uniform 
(white) or Gaussian (grey) kernel functions ir e (y\x,6). The Gaussian kernel sampler used 

e = 2/V3 to ensure a comparable standard deviation with the uniform kernel sampler. 



distributional tails for long periods of time due to low acceptance probabilities ( Sisson et al 



2007). An illustration of this shown in Figure 1.6| (a, b: lowest light grey lines), which 
displays the marginal sample paths of k and ip under the Gamma(fc, ip) model, based on 
5000 iterations of a sampler targetting 7r(9,x\y) with e = 2 and using the uniform kernel 
function 7c e (y\x,6). At around 1400 iterations the sampler becomes stuck in the tail of the 
posterior for the following 700 iterations, with very little meaningful movement. 

A simple strategy to improve sampler performance in this respect is to increase the number 
of auxiliary datasets S generated under the model, either by targetting the joint posterior 
TTLF(6,xi:s\y) or the marginal posterior iVLF(8\y) with S > 1 Monte Carlo draws (see Section 
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1.3.1). This approach will reduce the variability of the acceptance probability (1.3.4), and 



allow the Markov chain acceptance rate to approach that of a sampler targetting the true 



posterior ir(6\y). The trace plots in Figure 1.6 (a,b) (bottom to top) correspond to chains 



implementing S = 1,10,20 and 50 auxiliary dataset generations per likelihood evaluation. 
Visually, there is some suggestion that mixing is improved as S increases. Note however, 
that for any fixed S, the LF-MCMC sampler may still become stuck if the sampler explores 
sufficiently far into the distributional tail. 



Figure 1.6 (c,d) investigates this idea from an alternative perspective. Based on 2 million 
sampler iterations, the lengths of sojourns that the k parameter spent above a fixed threshold 
k were recorded. A sojourn length is defined as the consecutive number of iterations in which 
the parameter k remains above k. Intuitively, if likelihood-free samplers tend to persist in 
distributional tails, the length of the sojourns will be much larger for the worse performing 



samplers. Figure 1.6 (c,d) shows the distributions of sojourn lengths for samplers with 
S = 1, 10, 25 and 50 auxiliary datasets, with k = 45 (panel c) and k = 50 (panel d). Boxplot 
shading indicates use of the uniform (white) or Gaussian (grey) weighting function n € (y\x, 8). 

A number of points are immediately apparent. Firstly, chain mixing is poorer the further 
into the tails the sampler explores. This is illustrated by the increased scale of the sojourn 
lengths for k = 50 compared to k = 45. Secondly, increasing S by a small amount substan- 
tially reduces chain tail persistence. As 5* increases further, the Markov chain performance 
approaches that of a sampler directly targetting the true posterior tt (0\y), and so less per- 
formance gains are observed by increasing S beyond a certain point. Finally, there is strong 
evidence to suggest that LF-MCMC algorithms using weighting kernel functions ir e (y\x, 9) 
that do not generate large numbers of zero-valued likelihoods will possess superior perfor- 
mance to those which do. Here use of the Gaussian weighting kernel clearly outperforms the 
uniform kernel in all cases. In summary, it would appear that the choice of kernel weighting 
function ir e (8\y) has a larger impact on sampler performance than the number of auxiliary 
datasets S. 
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Figure 1.7: Margin al like lihood-free posterior distributions 71"^ (r|?/) of the error-distribution 
augmented model (1.3.6), under the Exponential (top plots) and Gamma (bottom plots) 
models. Plots are based on 50,000 sampler iterations. 
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In order to evaluate the adequacy of both Exponential and Gamma models in terms of their 
support for the observed data T(y) = (y, s y ), we fit the error-distribution augmented model 



(1.3.6 ) given by 

ttlf(#, %i:S, r\y) := mm£ r (r r \y, x 1:S , 9)ic(xi :3 \O)ic(0)ir{ 



as described in Section 1.3.2 (Ratmann et al. , 2009). The vector r = (ri,^) with r r 



T r (x) — T r (y) for r = 1, 2, describes the error under the model in reproducing the observed 
summary statistics T{y). The marginal likelihood-free posterior 7Tlf(t|?/) should be centered 
on the zero vector for models which can adequately account for the observed data. 



We follow Ratmann et al. (2009) in specifying K in (1.3.7) as a biweight (quartic) kernel 
with an adaptive bandwidth e r determined by twice the interquartile range of T r (x s ) — T r (y) 
given xi-s = {x 1 , . . . , x s ). The prior on the error r is determined as 7t(t) = Yl r 7r ( r r)? where 
7r(r r ) = exp(— \r r \/5 r )/(25 r ) with 8\ = 62 = 0.75 for both Exponential and Gamma models. 

Based on 50,000 sampler iterations using S = 50 auxiliary datasets, the resulting bivariate 



posterior 7Tii?(r|y) is illustrated in Figure 1.7 for both models. From these plots, the errors r 



under the Gamma model (bottom plots) are clearly centered on the origin, with 50% marginal 



high-density regions given by T\\y ~ [—0.51,0.53] and T2\y ~ [—0.44,0.22] (Ratmann et al 
2009). However for the Exponential model (top plots), while the marginal 50% high density 
regions T\\y ~ [—0.32,1.35] and T2\y ~ [—0.55,0.27] also both contain zero, there is some 
indication of model mis-specification as the joint posterior error distribution r\y is not fully 
centered on the zero vector. Based on this assessment, and recalling the discussion on the 



exploratory analysis in Section 1.4.1 the Gamma model would appear to provide a better 
overall fit to the observed data. 
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In the early 1990's, the introduction of accessible Markov chain Monte Carlo samplers pro- 
vided the catalyst for a rapid adoption of Bayesian methods and inference as credible tools in 
model-based research. Twenty years later, the demand for computational techniques capable 
of handling the types of models inspired by complex hypotheses has resulted in new classes 
of simulation-based inference, that are again expanding the applicability and relevance of 
the Bayesian paradigm to new levels. 

While the focus of the present article centers on Markov chain-based, likelihood-free 
simulation, alternative methods to obtain samples from n LF (9\y) have been developed, each 
with their own benefits and drawbacks. While MCMC-based samplers can be more efficient 
than rejection sampling algorithms, the tendency of sampler performance to degrade in 
regions of low posterior density (see Section 1.4.5 Sisson et al. ( 2007[ )) can be detrimental 
to sampler efficiency. One class of methods, based on the output of a rejection sampler with 
a high e value (for efficiency), uses standard multivariate regression methods to estimate 



the relationship between the summary statistics T(x) and parameter vectors 9 (Beaumont 



et al., 2002; Blum and Francois, 2009 Marjoram and Tavare 2006). The idea is then 



to approximately transform the sampled observations from (9,T(x)) to (9*,T(y)) so that 
the adjusted likelihood-free posterior iiLF(9,x\y) — > ttlf(9* ,y\y) ~ n{9\y) is an improved 
approximation. Further attempts to improve sampler efficiency over MCMC-based methods 
have resulted in the development of likelihood-free sequential Monte Carlo and sequential 



importance sampling algorithms (Beaumont et al. , 2009 Del Moral et al. , 2008 Peters et al 



2008 


Sisson et al. 


2007 


Toni et al. , 


2009) 



free sequential Monte Carlo approaches can outperform their MCMC counterparts (McKinley 



et al. 


2009 


Sisson et al. 


2007 



There remain many open research questions in likelihood-free Bayesian inference. These 
include how to select and incorporate the vectors of summary statistics T(-), how to perform 
posterior simulation in the most efficient manner, and which form of joint likelihood-free 
posterior models and kernel weighting functions admit the most effective marginal approxi- 
mation to the true posterior itLF{9\y) ~ n(9\y). Additionally, the links to existing bodies of 
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research, including non-parametrics (Blum, 2009) and indirect inference (Jiang and Turnbull 



2004), are at best poorly understood. 



Finally, there is an increasing trend towards using likelihood-free inference for model 



selection purposes (Grelaud et al. , 2009 Toni et al. 2009). While this is a natural extension 



of inference for individual models, the analysis in Section |1.4.4| urges caution and suggests 
that further research is needed into the effect of the likelihood-free approximation both 
within models and on the marginal likelihoods 7r LF (y) = f y 7i- LF (9\y)d9 upon which model 
comparison is based. 
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